Serveur d'exploration sur l'OCR

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

A 300 MB Turkish Corpus and Word Analysis

Identifieur interne : 001950 ( Main/Exploration ); précédent : 001949; suivant : 001951

A 300 MB Turkish Corpus and Word Analysis

Auteurs : Gökhan Dalkilic [Turquie] ; Yalcin Cebi [Turquie]

Source :

RBID : ISTEX:8865D6FEDBD0E4A01B9DB58C91A4FF9F344ED290

Abstract

Abstract: In order to determine some properties of a language, a corpus of that language should be created. To analyze Turkish language, at first, a Turkish corpus having ~300 MB capacity and more than 44 million words was prepared by using 10 different web sites having Turkish content. Most frequently used word statistics of Turkish were calculated by using this corpus. Frequencies of most frequently used first 7 words were compared with their equivalent in English, and it was found out that most frequently used words are not nouns in natural languages Most frequently used words having 1 to 5 letters were determined and they were applied onto a randomly selected text in order to test the validity of the process.

Url:
DOI: 10.1007/3-540-36077-8_20


Affiliations:


Links toward previous steps (curation, corpus...)


Le document en format XML

<record>
<TEI wicri:istexFullTextTei="biblStruct">
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">A 300 MB Turkish Corpus and Word Analysis</title>
<author>
<name sortKey="Dalkilic, Gokhan" sort="Dalkilic, Gokhan" uniqKey="Dalkilic G" first="Gökhan" last="Dalkilic">Gökhan Dalkilic</name>
</author>
<author>
<name sortKey="Cebi, Yalcin" sort="Cebi, Yalcin" uniqKey="Cebi Y" first="Yalcin" last="Cebi">Yalcin Cebi</name>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">ISTEX</idno>
<idno type="RBID">ISTEX:8865D6FEDBD0E4A01B9DB58C91A4FF9F344ED290</idno>
<date when="2002" year="2002">2002</date>
<idno type="doi">10.1007/3-540-36077-8_20</idno>
<idno type="url">https://api.istex.fr/document/8865D6FEDBD0E4A01B9DB58C91A4FF9F344ED290/fulltext/pdf</idno>
<idno type="wicri:Area/Istex/Corpus">001A17</idno>
<idno type="wicri:Area/Istex/Curation">001912</idno>
<idno type="wicri:Area/Istex/Checkpoint">001064</idno>
<idno type="wicri:doubleKey">0302-9743:2002:Dalkilic G:a:mb:turkish</idno>
<idno type="wicri:Area/Main/Merge">001A30</idno>
<idno type="wicri:Area/Main/Curation">001950</idno>
<idno type="wicri:Area/Main/Exploration">001950</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title level="a" type="main" xml:lang="en">A 300 MB Turkish Corpus and Word Analysis</title>
<author>
<name sortKey="Dalkilic, Gokhan" sort="Dalkilic, Gokhan" uniqKey="Dalkilic G" first="Gökhan" last="Dalkilic">Gökhan Dalkilic</name>
<affiliation wicri:level="1">
<country xml:lang="fr">Turquie</country>
<wicri:regionArea>Computer Engineering Dept., Dokuz Eylul University, 35100, Bornova, Izmir</wicri:regionArea>
<wicri:noRegion>Izmir</wicri:noRegion>
</affiliation>
<affiliation wicri:level="1">
<country wicri:rule="url">Turquie</country>
</affiliation>
</author>
<author>
<name sortKey="Cebi, Yalcin" sort="Cebi, Yalcin" uniqKey="Cebi Y" first="Yalcin" last="Cebi">Yalcin Cebi</name>
<affiliation wicri:level="1">
<country xml:lang="fr">Turquie</country>
<wicri:regionArea>Computer Engineering Dept., Dokuz Eylul University, 35100, Bornova, Izmir</wicri:regionArea>
<wicri:noRegion>Izmir</wicri:noRegion>
</affiliation>
<affiliation wicri:level="1">
<country wicri:rule="url">Turquie</country>
</affiliation>
</author>
</analytic>
<monogr></monogr>
<series>
<title level="s">Lecture Notes in Computer Science</title>
<imprint>
<date>2002</date>
</imprint>
<idno type="ISSN">0302-9743</idno>
<idno type="ISSN">0302-9743</idno>
</series>
<idno type="istex">8865D6FEDBD0E4A01B9DB58C91A4FF9F344ED290</idno>
<idno type="DOI">10.1007/3-540-36077-8_20</idno>
<idno type="ChapterID">20</idno>
<idno type="ChapterID">Chap20</idno>
</biblStruct>
</sourceDesc>
<seriesStmt>
<idno type="ISSN">0302-9743</idno>
</seriesStmt>
</fileDesc>
<profileDesc>
<textClass></textClass>
<langUsage>
<language ident="en">en</language>
</langUsage>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">Abstract: In order to determine some properties of a language, a corpus of that language should be created. To analyze Turkish language, at first, a Turkish corpus having ~300 MB capacity and more than 44 million words was prepared by using 10 different web sites having Turkish content. Most frequently used word statistics of Turkish were calculated by using this corpus. Frequencies of most frequently used first 7 words were compared with their equivalent in English, and it was found out that most frequently used words are not nouns in natural languages Most frequently used words having 1 to 5 letters were determined and they were applied onto a randomly selected text in order to test the validity of the process.</div>
</front>
</TEI>
<affiliations>
<list>
<country>
<li>Turquie</li>
</country>
</list>
<tree>
<country name="Turquie">
<noRegion>
<name sortKey="Dalkilic, Gokhan" sort="Dalkilic, Gokhan" uniqKey="Dalkilic G" first="Gökhan" last="Dalkilic">Gökhan Dalkilic</name>
</noRegion>
<name sortKey="Cebi, Yalcin" sort="Cebi, Yalcin" uniqKey="Cebi Y" first="Yalcin" last="Cebi">Yalcin Cebi</name>
<name sortKey="Cebi, Yalcin" sort="Cebi, Yalcin" uniqKey="Cebi Y" first="Yalcin" last="Cebi">Yalcin Cebi</name>
<name sortKey="Dalkilic, Gokhan" sort="Dalkilic, Gokhan" uniqKey="Dalkilic G" first="Gökhan" last="Dalkilic">Gökhan Dalkilic</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 001950 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 001950 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     ISTEX:8865D6FEDBD0E4A01B9DB58C91A4FF9F344ED290
   |texte=   A 300 MB Turkish Corpus and Word Analysis
}}

Wicri

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024